{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# 第9章 EM算法及其推广"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## 习题9.1\n",
    "\n",
    "&emsp;&emsp;如例9.1的三硬币模型，假设观测数据不变，试选择不同的初值，例如，$\\pi^{(0)}=0.46,p^{(0)}=0.55,q^{(0)}=0.67$，求模型参数为$\\theta=(\\pi,p,q)$的极大似然估计。  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "**解答：**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答思路：**\n",
    "1. 列出例9.1的三硬币模型；\n",
    "2. 写出三硬币模型的EM算法；\n",
    "3. 根据上述EM算法，编写代码，并求出模型参数的极大似然估计。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答步骤：**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第1步：例9.1的三硬币模型**\n",
    "\n",
    "&emsp;&emsp;根据书中第175页例9.1（三硬币模型）：\n",
    "> &emsp;&emsp;假设有3枚硬币，分别记作A，B，C。这些硬币正面出现的概率分别是$\\pi$，$p$和$q$。进行如下掷硬币试验：先掷硬币A，根据其结果选出硬币B或硬币C，正面选硬币B，反面选硬币C；然后掷选出的硬币，掷硬币的结果，出现正面记作1，出现方面记作0；独立地重复$n$次试验（这里，$n=10$），观测结果如下：\n",
    "> $$\n",
    "1,1,0,1,0,0,1,0,1,1\n",
    "$$\n",
    "> 假设只能观测到掷硬币的结果，不能观测掷硬币的过程。  \n",
    "> \n",
    "> &emsp;&emsp;三硬币模型可以写作\n",
    "> $$\n",
    "\\begin{aligned}\n",
    "P(y|\\theta) &= \\sum_z P(y, z | \\theta) = \\sum_z P(z|\\theta) P(y | z, \\theta) \\\\\n",
    "&= \\pi p^y (1-p)^{1-y} + (1 - \\pi) q^y (1- q)^{1-y}\n",
    "\\end{aligned}\n",
    "$$\n",
    "> 这里：\n",
    "> 1. 随机变量$y$是观测变量，表示一次试验观测的结果是1或0；\n",
    "> 2. 随机变量$z$是隐变量，表示未观测到的掷硬币A的结果；\n",
    "> 3. $\\theta=(\\pi, p, q)$是模型参数。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第2步：三硬币模型的EM算法**\n",
    "\n",
    "&emsp;&emsp;根据书中第176页三硬币模型的EM算法：\n",
    "> &emsp;&emsp;EM算法首先选取参数的初值，记作$\\theta^{(0)}=(\\pi^{(0)}, p^{(0)}, q^{(0)})$，然后通过下面的步骤迭代计算参数的估计值，直至收敛为止。第$i$次迭代参数的估计值为$\\theta^{(i)}=(\\pi^{(i)}, p^{(i)}, q^{(i)})$。EM算法的第$i+1$次迭代如下：\n",
    "> \n",
    "> &emsp;&emsp;E步：计算在模型参数$\\pi^{(i)}, p^{(i)}, q^{(i)}$下观测数据$y_j$来自掷硬币B的概率\n",
    "> $$\n",
    "\\mu_j^{(i+1)} = \\frac{\\pi^{(i)} (p^{(i)})^{y_j} (1-p^{(i)})^{1-y_j}}{\\pi^{(i)} (p^{(i)})^{y_j} (1-p^{(i)})^{1-y_j} + (1-\\pi^{(i)}) (q^{(i)})^{y_j} (1-q^{(i)})^{1-y_j}}\n",
    "$$\n",
    "> &emsp;&emsp;M步：计算模型参数的新估计值\n",
    "> $$\n",
    "\\pi^{(i+1)} = \\frac{1}{n} \\sum_{j=1}^N \\mu_j^{(i+1)} \\\\\n",
    "p^{(i+1)} = \\frac{ \\displaystyle \\sum_{j=1}^n \\mu_j^{(i+1)} y_j }{ \\displaystyle \\sum_{j=1}^n \\mu_j^{(i+1)} } \\\\\n",
    "q^{(i+1)} = \\frac{ \\displaystyle \\sum_{j=1}^n ( 1 - \\mu_j^{(i+1)} ) y_j }{ \\displaystyle \\sum_{j=1}^n ( 1 - \\mu_j^{(i+1)} ) }\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第3步：编写代码并求出模型参数的极大似然估计**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import math\n",
    "\n",
    "\n",
    "class ThreeCoinEM:\n",
    "    def __init__(self, prob, tol=1e-6, max_iter=1000):\n",
    "        \"\"\"\n",
    "        初始化模型参数\n",
    "        :param prob: 模型参数的初值\n",
    "        :param tol: 收敛阈值\n",
    "        :param max_iter: 最大迭代次数\n",
    "        \"\"\"\n",
    "        self.prob_A, self.prob_B, self.prob_C = prob\n",
    "        self.tol = tol\n",
    "        self.max_iter = max_iter\n",
    "\n",
    "    def calc_mu(self, j):\n",
    "        \"\"\"\n",
    "        （E步）计算mu\n",
    "        :param j: 观测数据y的第j个\n",
    "        :return: 在模型参数下观测数据yj来自掷硬币B的概率\n",
    "        \"\"\"\n",
    "        # 掷硬币A观测结果为正面\n",
    "        pro_1 = self.prob_A * \\\n",
    "            math.pow(self.prob_B, data[j]) * \\\n",
    "            math.pow((1 - self.prob_B), 1 - data[j])\n",
    "        # 掷硬币A观测结果为反面\n",
    "        pro_2 = (1 - self.prob_A) * math.pow(self.prob_C,\n",
    "                                             data[j]) * math.pow((1 - self.prob_C), 1 - data[j])\n",
    "        return pro_1 / (pro_1 + pro_2)\n",
    "\n",
    "    def fit(self, data):\n",
    "        count = len(data)\n",
    "        print(\"模型参数的初值：\")\n",
    "        print(\"prob_A={}, prob_B={}, prob_C={}\".format(\n",
    "            self.prob_A, self.prob_B, self.prob_C))\n",
    "        print(\"EM算法训练过程：\")\n",
    "        for i in range(self.max_iter):\n",
    "            # （E步）得到在模型参数下观测数据yj来自掷硬币B的概率\n",
    "            _mu = [self.calc_mu(j) for j in range(count)]\n",
    "            # （M步）计算模型参数的新估计值\n",
    "            prob_A = 1 / count * sum(_mu)\n",
    "            prob_B = sum([_mu[k] * data[k] for k in range(count)]) \\\n",
    "                / sum([_mu[k] for k in range(count)])\n",
    "            prob_C = sum([(1 - _mu[k]) * data[k] for k in range(count)]) \\\n",
    "                / sum([(1 - _mu[k]) for k in range(count)])\n",
    "            print('第{}次：prob_A={:.4f}, prob_B={:.4f}, prob_C={:.4f}'.format(\n",
    "                i + 1, prob_A, prob_B, prob_C))\n",
    "            # 计算误差值\n",
    "            error = abs(self.prob_A - prob_A) + \\\n",
    "                abs(self.prob_B - prob_B) + abs(self.prob_C - prob_C)\n",
    "            self.prob_A = prob_A\n",
    "            self.prob_B = prob_B\n",
    "            self.prob_C = prob_C\n",
    "            # 判断是否收敛\n",
    "            if error < self.tol:\n",
    "                print(\"模型参数的极大似然估计：\")\n",
    "                print(\"prob_A={:.4f}, prob_B={:.4f}, prob_C={:.4f}\".format(self.prob_A, self.prob_B,\n",
    "                                                                           self.prob_C))\n",
    "                break"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "模型参数的初值：\n",
      "prob_A=0.46, prob_B=0.55, prob_C=0.67\n",
      "EM算法训练过程：\n",
      "第1次：prob_A=0.4619, prob_B=0.5346, prob_C=0.6561\n",
      "第2次：prob_A=0.4619, prob_B=0.5346, prob_C=0.6561\n",
      "模型参数的极大似然估计：\n",
      "prob_A=0.4619, prob_B=0.5346, prob_C=0.6561\n"
     ]
    }
   ],
   "source": [
    "# 加载数据\n",
    "data = [1, 1, 0, 1, 0, 0, 1, 0, 1, 1]\n",
    "# 模型参数的初值\n",
    "init_prob = [0.46, 0.55, 0.67]\n",
    "\n",
    "# 三硬币模型的EM模型\n",
    "em = ThreeCoinEM(prob=init_prob, tol=1e-5, max_iter=100)\n",
    "# 模型训练\n",
    "em.fit(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;可见通过两次迭代，模型参数已经收敛，三硬币正面出现的概率分别为0.4619，0.5346，0.6561"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 习题9.2\n",
    "证明引理9.2。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答：**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答思路：**\n",
    "1. 写出需要证明的引理9.2；\n",
    "2. 列出$F$函数定义；\n",
    "3. 根据引理9.1，进行公式推导；\n",
    "4. 根据约束条件$\\displaystyle \\sum_z \\tilde{P}_{\\theta}(Z) = 1$，可证明引理9.2。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答步骤：**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第1步：需要证明的引理9.2**\n",
    "\n",
    "&emsp;&emsp;根据书中第188页引理9.2：\n",
    "> 若$\\tilde{P}_{\\theta}(Z)=P(Z | Y, \\theta)$，则\n",
    "> $$\n",
    "F(\\tilde{P}, \\theta)=\\log P(Y|\\theta)\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第2步：$F$函数定义**\n",
    "\n",
    "&emsp;&emsp;根据书中第187页$F$函数定义：\n",
    "> &emsp;&emsp;假设隐变量数据$Z$的概率分布为$\\tilde{P}(Z)$，定义分布$\\tilde{P}$与参数$\\theta$的函数$F(\\tilde{P}, \\theta)$如下：\n",
    "> \n",
    "> $$\n",
    "F(\\tilde{P}, \\theta) = E_{\\tilde{P}}[\\log P(Y, Z|\\theta)] + H(\\tilde{P})\n",
    "$$\n",
    "> \n",
    "> 称为$F$函数。式中$H(\\tilde{P}) = - E_{\\tilde{P}} \\log \\tilde{P}(Z)$是分布$\\tilde{P}(Z)$的熵。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第3步：引理9.1**\n",
    "\n",
    "&emsp;&emsp;根据书中第187页引理9.1：\n",
    "> 对于固定的$\\theta$，存在唯一的分布$\\tilde{P}_{\\theta}$极大化$F(\\tilde{P}, \\theta)$，这时$\\tilde{P}_{\\theta}$由下式给出：\n",
    "> $$\n",
    "\\tilde{P}_{\\theta}(Z) = P(Z | Y, \\theta)\n",
    "$$\n",
    "> 并且$\\tilde{P}_{\\theta}$随$\\theta$连续变化。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$\\begin{aligned}\n",
    "\\therefore F(\\tilde{P}, \\theta) \n",
    "&= E_{\\tilde{P}}[\\log P(Y, Z|\\theta)] + H(\\tilde{P}) \\\\\n",
    "&= E_{\\tilde{P}}[\\log P(Y,Z|\\theta)] -E_{\\tilde{P}} \\log \\tilde{P}(Z) \\quad （F函数定义：H(\\tilde{P}) = - E_{\\tilde{P}} \\log \\tilde{P}(Z)）\\\\\n",
    "&= \\sum_Z \\log P(Y,Z|\\theta) \\tilde{P}_{\\theta}(Z) - \\sum_Z \\log \\tilde{P}(Z) \\cdot \\tilde{P}(Z)\n",
    "\\end{aligned}$ "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "根据引理9.1：$\\tilde{P}_{\\theta}(Z) = P(Z | Y, \\theta)$\n",
    "\n",
    "$\\begin{aligned}\n",
    "F(\\tilde{P}, \\theta)\n",
    "&= \\sum_Z \\log P(Y,Z|\\theta) \\tilde{P}_{\\theta}(Z) - \\sum_Z \\log \\tilde{P}(Z) \\cdot \\tilde{P}(Z) \\\\\n",
    "&= \\sum_Z \\log P(Y,Z|\\theta) P(Z|Y,\\theta) -  \\sum_Z \\log P(Z|Y,\\theta) \\cdot P(Z|Y,\\theta) \\\\\n",
    "&= \\sum_Z P(Z|Y,\\theta) \\left[ \\log P(Y,Z|\\theta) - \\log P(Z|Y,\\theta) \\right] \\\\\n",
    "&= \\sum_Z P(Z|Y,\\theta) \\log \\frac{P(Y,Z|\\theta)}{P(Z|Y,\\theta)} \\\\\n",
    "&= \\sum_Z P(Z|Y,\\theta) \\log P(Y|\\theta) \\\\\n",
    "&= \\log P(Y|\\theta) \\sum_Z P(Z|Y,\\theta) \n",
    "\\end{aligned}$  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第4步：根据引理9.1，得证**\n",
    "\n",
    "根据引理9.1，可知：$\\displaystyle \\sum_Z P(Z|Y, \\theta) = \\sum_Z \\tilde{P}_{\\theta}(Z) = 1$  \n",
    "\n",
    "$\\therefore F(\\tilde{P}, \\theta) = \\log P(Y|\\theta)$，引理9.2得证。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 习题9.3\n",
    "已知观测数据  \n",
    "-67，-48，6，8，14，16，23，24，28，29，41，49，56，60，75  \n",
    "试估计两个分量的高斯混合模型的5个参数。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答：**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答思路：**\n",
    "\n",
    "&emsp;&emsp;两个分量的高斯混合模型一共有6个参数$\\mu_1, \\mu_2, \\sigma_1, \\sigma_2, \\alpha_1, \\alpha_2$，其中$\\alpha_2$可由$\\alpha_2 = 1- \\alpha_1$得到，故仅估计5个参数即可。\n",
    "1. 写出高斯混合模型；\n",
    "2. 写出高斯混合模型参数估计的EM算法；\n",
    "3. 采用sklearn的GaussianMixture计算6个参数；\n",
    "4. 采用自编程实现高斯混合模型的EM算法。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答步骤：**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第1步：高斯混合模型**\n",
    "\n",
    "&emsp;&emsp;根据书中第183页高斯混合模型：\n",
    "> 高斯混合模型是指具有如下形式的概率分布模型：\n",
    "> $$\n",
    "P(y | \\theta) = \\sum_{k=1}^K \\alpha_k \\phi(y|\\theta_k)\n",
    "$$\n",
    "> 其中，$\\alpha_k$是系数，$\\alpha_k \\geqslant 0$，$\\displaystyle \\sum_{k=1}^K \\alpha_k = 1$；$\\phi(y|\\theta)$是高斯分布密度，$\\theta_k=(u_k, \\sigma_k^2)$，\n",
    "> $$\n",
    "\\phi(y|\\theta_k) = \\frac{1}{\\sqrt{2 \\pi} \\sigma_k} \\exp \\left( -\\frac{(y - \\mu_k)^2}{ 2 \\sigma_k^2} \\right)\n",
    "$$\n",
    "> 称为第$k$个分模型。\n",
    "\n",
    "从上述描述中可知，如果是2个高斯混合分模型，一共需要估计的参数有6个$\\mu_1, \\mu_2, \\sigma_1, \\sigma_2, \\alpha_1, \\alpha_2$，其中$\\alpha_1 + \\alpha_2 = 1$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第2步：高斯混合模型参数估计的EM算法**\n",
    "\n",
    "&emsp;&emsp;根据书中第186页算法9.2：\n",
    "> 输入：观测数据$y_1, y_2, \\cdots, y_N$，高斯混合模型；  \n",
    "输出：高斯混合模型参数。  \n",
    "（1）取参数的初始值开始迭代；  \n",
    "（2）E步：依据当前模型参数，计算分模型$k$对观测数据$y_i$的响应度\n",
    ">$$\n",
    "\\hat{\\gamma}_{jk} = \\frac{\\alpha_k \\phi(y_j | \\theta_k)}{\\displaystyle \\sum_{k=1}^K \\alpha_k \\phi(y_j | \\theta_k)}, \\quad j=1,2,\\cdots,N; \\quad k=1,2,\\cdots,K\n",
    "$$  \n",
    ">（3）M步：计算新一轮迭代的模型参数\n",
    "> $$\n",
    "\\hat{u}_k = \\frac{\\displaystyle \\sum_{j=1}^N \\hat{\\gamma}_{jk} y_j }{\\displaystyle \\sum_{j=1}^N \\hat{\\gamma}_{jk}}, \\quad k=1,2,\\cdots,K \\\\\n",
    "\\hat{\\sigma}_k^2 = \\frac{\\displaystyle \\sum_{j=1}^N \\hat{\\gamma}_{jk} (y_j - u_k)^2 }{\\displaystyle \\sum_{j=1}^N \\hat{\\gamma}_{jk} }, \\quad k=1,2,\\cdots,K \\\\\n",
    "\\hat{\\alpha}_k = \\frac{\\displaystyle \\sum_{j=1}^N \\hat{\\gamma}_{jk}}{N}, \\quad k=1,2,\\cdots,K\n",
    "$$  \n",
    ">（4）重复第（2）步和第（3）步，直到收敛。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第3步：采用sklearn的GaussianMixture计算6个参数**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "分类结果：labels = [0 0 1 1 1 1 1 1 1 1 1 1 1 1 1]\n",
      "\n",
      "两个分量高斯混合模型的6个参数如下：\n",
      "means = [[-57.51107027  32.98489643]]\n",
      "covariances = [[ 90.24987882 429.45764867]]\n",
      "weights =  [[0.13317238 0.86682762]]\n"
     ]
    }
   ],
   "source": [
    "from sklearn.mixture import GaussianMixture\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# 初始化观测数据\n",
    "data = np.array([-67, -48, 6, 8, 14, 16, 23, 24, 28,\n",
    "                29, 41, 49, 56, 60, 75]).reshape(-1, 1)\n",
    "\n",
    "# 设置n_components=2，表示两个分量高斯混合模型\n",
    "gmm_model = GaussianMixture(n_components=2)\n",
    "# 对模型进行参数估计\n",
    "gmm_model.fit(data)\n",
    "# 对数据进行聚类\n",
    "labels = gmm_model.predict(data)\n",
    "\n",
    "# 得到分类结果\n",
    "print(\"分类结果：labels = {}\\n\".format(labels))\n",
    "print(\"两个分量高斯混合模型的6个参数如下：\")\n",
    "# 得到参数u1,u2\n",
    "print(\"means =\", gmm_model.means_.reshape(1, -1))\n",
    "# 得到参数sigma1, sigma1\n",
    "print(\"covariances =\", gmm_model.covariances_.reshape(1, -1))\n",
    "# 得到参数a1, a2\n",
    "print(\"weights = \", gmm_model.weights_.reshape(1, -1))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYcAAAEWCAYAAACNJFuYAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy86wFpkAAAACXBIWXMAAAsTAAALEwEAmpwYAAAZ10lEQVR4nO3deZhldX3n8fdHWkE2AbtpkCaCiktLTGtKBMwwRmBEJbbPTNqgop0xeZi4J6OjEDKJZkJCTB4TM1l73FBRQuMCSTSKGMOTiEujLQQQIUGhWZpSA7RL0ILv/HFO67XO7eqCrlunLvf9ep5+zj3LPed7q6vu5/x+Z0tVIUnSoAf1XYAkaekxHCRJHYaDJKnDcJAkdRgOkqQOw0GS1GE46AEpyceSrO9x+z+R5NtJduurhsWW5E1J3jfPZT+d5JdHXZPuP8NBCyLJKUk+l+Q7SW5vX78iSfqop6qeXVXnLPR6k/xikkry1lnTn99Of3e7/Rurau+qumce63x3kt9Z6Frn2N4z2lo/NGv6T7XTP71YtWjpMhy0y5K8Dngb8AfAQcBK4FeApwMP6bG0UflX4BeSLBuY9lLgq30UM6uO+ZoGjk3y8IFp6+npM2jpMRy0S5I8DPht4BVVdUFVbavGl6rqxVV1d7vcc5N8KcldSW5K8qaBdTwjyZZZ6/1akhPa10cl2dS+d+v2vfYkeyR5X5JvJrkjyReSrGzn/bDbIsmjk3yqXe4bSc5Nst+sbb0+yRVJ7kzy10n2mONj3wZcCTyrff8BwLHARQPrPKzdC1+W5IAkW5L8XDtv7yTXJ3lpktOAFwNvaLuh/qZdppI8ZmB9P2xdbP95JXljktuAdyV5UJLTk/xr+znPb+vake8DHwFOade5G/AC4NxZ/w/Htj/XO9vhsQPzDk/yj0m2JbkYWD7rvUcn+Uz7f/PlJM+Yox4tMYaDdtUxwO7AhTtZ7js0e9f7Ac8FXp7k+fPcxtuAt1XVvsCjgfPb6euBhwGHAg+naa18b8j7A/we8AjgCe3yb5q1zAuAk4DDgScBv7iTmt7Tfh5ovmAvBO4etmBVfQt4GfD/khwI/BGwuareU1UbaL6Q39J2Q/3cTra73UHAAcAjgdOA1wDPB/5z+zn/Hfiz+/AZngVcBdyyfWYbLn8H/AnNz/etwN8NtDbeD1xOEwr/h+b/Y/t7D2nf+zttna8HPphkxTw/n3pmOGhXLQe+UVUz2ycM7C1+L8lxAFX16aq6sqruraorgA/QfJHNxw+AxyRZXlXfrqrPDkx/OPCYqrqnqi6vqrtmv7mqrq+qi6vq7qqapvmSm73tP6mqW9ov8r8B1uykpg8Dz2hbTi+l+aLdoar6BLARuIQmHP/HTta/M/cCv9V+pu+16zuzqra0rbU3AT8/V5dTVX0GOCDJ43bwGZ4LXFdV762qmar6APAV4OeS/ATwVOB/tzVcSvNz2+5U4KNV9dH2//xiYBPwnF383FokhoN21TeB5YNfQlV1bFXt1857EECSpyX5hyTTSe6k2ctfPmyFQ/wS8FjgK23Xxsnt9PcCHwfOS3JLkrckefDsNyc5MMl5SW5OchfwviHbvm3g9XeBvecqqP1C/jvgN4DlVfXP8/gcG4AjgXdV1TfnsfxcpqvqPwbGHwl8uA3lO4BrgHtojv/M5b3Aq4CfpQm8QY8Avj5r2teBQ9p5/15V35k1b7CeddvraWv6GeDgnX0wLQ2Gg3bVZTTdKWt3stz7afrkD62qhwF/SdPdA02X057bF2z7v3/Y/VBV11XVC4EDgd8HLkiyV1X9oKreXFWrafr8T+ZH3SSDfg8o4Elt19SpA9veFe8BXkfzBTun9jP9Vfuelw8eT2hrm+27DPxMaLqRBs1+z03As6tqv4F/e1TVzTsp7b3AK2j28r87a94tNF/yg34CuBm4Fdg/yV6z5g3W895Z9exVVWfvpB4tEYaDdklV3QG8GfjzJD/fHmx9UJI1wOAXxz7At6rqP5IcBbxoYN5XgT3ag9YPptkb3337zCSnJllRVfcCd7ST70nys0l+sv3ivYumm2nYqaP7AN8G7mj7wv/Xrn9yAP4ROBH4v/NY9tfb4cuAPwTekx9dA7EVeNSs5TcDL0qyW5KT2HkX3F8CZyV5JECSFUl2FthU1Q3tus8cMvujwGOTvKg9sP4LwGrgb6vq6zTdRG9O8pAkPwMMHi95H03307Paz7BHeyB91c5q0tJgOGiXVdVbgP8JvAG4nebL7q+ANwKfaRd7BfDbSbYBv8mPDipTVXe2899Os1f6HWDw7KWTgKuSfJvm4PQpbZfKQcAFNMFwDc2X9bCLsN4MPAW4k6Yr6ENDlrnP2rOyLmmPU+xQkp+m+fm8tL3u4fdp9vxPbxd5B7C67X75SDvttTRftnfQnM30Eeb2NpqW2Sfan/FngafN83P8U1XdMmT6N2laY6+j6SJ8A3ByVX2jXeRF7Ta+BfwWA8csquommtbkr9OcNnsTTSj7nTMm4sN+JEmzmeKSpA7DQZLUYThIkjoMB0lSx/25YdeCSfJrwC/TnLlxJfDfac7t/mvgMOBrwAuq6t/nWs/y5cvrsMMOG2WpkvSAc/nll3+jqobe0qS3s5Xa883/CVhdVd9Lcj7NedWrac6HPzvJ6cD+VfXGudY1NTVVmzZtGn3RkvQAkuTyqpoaNq/vbqVlwEPbWy/sSXNF5lpg+334z6G5mZgkaRH1Fg7tZf1/CNxIcyn+ne3NyVZW1a3tMrfS3DKhI8lpaW7jvGl6enqxypakidBbOCTZn6aVcDjNTbz2SnLqfN9fVRuqaqqqplas8C7AkrSQ+uxWOgG4oaqmq+oHNLc0OBbYmuRggHZ4e481StJE6jMcbgSOTrJnkgDH09wf5yJ+9NCQ9ez8ITKSpAXW26msVfW5JBcAXwRmgC/R3O9+b+D8JL9EEyDr+qpRkiZVr2crVdVvVdXjq+rIqnpJ+0Spb1bV8VV1RDuc846XkjSJtm6F446Dffdthlu3Luz6+z6VVZJ0P6xbB5ddBtu2NcN1C9zHYjhI0hjavBlm2ie3z8w04wvJcJCkMbRmDSxrjxovW9aMLyTDQZLG0MaNcMwxsM8+zXDjxoVdf6833pMk3T8rV8Kll45u/bYcJEkdhoMkqcNwkCR1GA6SpA7DQZLUYThIkjoMB0lSh+EgSeowHCRJHYaDJKnDcJAkdRgOkqQOw0GS1GE4SJI6eg2HJPsluSDJV5Jck+SYJAckuTjJde1w/z5rlKRJ1HfL4W3A31fV44GfAq4BTgcuqaojgEvacUkaS1u3wnHHwb77NsOtW/uuaH56C4ck+wLHAe8AqKrvV9UdwFrgnHaxc4Dn91GfJC2Edevgsstg27ZmuG5d3xXNT58th0cB08C7knwpyduT7AWsrKpbAdrhgcPenOS0JJuSbJqenl68qiXpPti8GWZmmtczM834OOgzHJYBTwH+oqqeDHyH+9CFVFUbqmqqqqZWrFgxqholaZesWQPL2gcyL1vWjI+DPsNhC7Clqj7Xjl9AExZbkxwM0A5v76k+SdplGzfCMcfAPvs0w40b+65ofpb1teGqui3JTUkeV1XXAscDV7f/1gNnt8ML+6pRknbVypVw6aV9V3Hf9X220quBc5NcAawBfpcmFE5Mch1wYjsuSSM1rmcVjUpvLQeAqtoMTA2ZdfwilyJpwm0/q2hm5kdnFY3jHv9C6bvlIElLwrieVTQqhoMkMb5nFY2K4SBJjO9ZRaPS6zEHSVoqxvWsolGx5SBJ6jAcJEkdhoMkqcNwkDR2vGBt9AwHSWNnXG+DPU4MB0ljxwvWRs9wkDR2vGBt9AwHSWPHC9ZGz4vgJI0dL1gbPVsOkqQOw0GS1GE4SJI6DAdJUofhIEnqMBwkSR29h0OS3ZJ8KcnftuMHJLk4yXXtcP++a5R0/3gPpPHVezgArwWuGRg/Hbikqo4ALmnHJY0h74E0vnoNhySrgOcCbx+YvBY4p319DvD8RS5L0gLxHkjjq++Wwx8DbwDuHZi2sqpuBWiHBw57Y5LTkmxKsml6enrkhUq677wH0vjqLRySnAzcXlWX35/3V9WGqpqqqqkVK1YscHWSFoL3QBpffd5b6enA85I8B9gD2DfJ+4CtSQ6uqluTHAzc3mONknaB90AaX721HKrqjKpaVVWHAacAn6qqU4GLgPXtYuuBC3sqUZImVt/HHIY5GzgxyXXAie24JGkRLYlbdlfVp4FPt6+/CRzfZz2SNOmWYstBktQzw0GS1GE4SJI6DAdJUofhIEnqMBwkSR2GgySpw3CQ5HMX1GE4SPK5C+owHKQxMqo9fJ+7oNkMB2mMjGoP3+cuaDbDQRojo9rD97kLmm1J3HhP0vysWdO0GGZmFnYP3+cuaDZbDtIYcQ9fi8VwkEZkFAePt+/h33VXM1y5ctfXKQ1jOEgj4umhGmeGgyaep4dKXYaDJp6nh0pdhoMmnqeHSl29hUOSQ5P8Q5JrklyV5LXt9AOSXJzkuna4f181ajKMag/fg8caZ322HGaA11XVE4CjgVcmWQ2cDlxSVUcAl7Tj0si4hy919XYRXFXdCtzavt6W5BrgEGAt8Ix2sXOATwNv7KFETQgvAJO6lsQxhySHAU8GPgesbINje4AcuIP3nJZkU5JN09PTi1arJE2C3sMhyd7AB4Ffraq75vu+qtpQVVNVNbVixYrRFShJE6jXcEjyYJpgOLeqPtRO3prk4Hb+wcDtfdUnSZOqz7OVArwDuKaq3jow6yJgfft6PXDhYtempcmnlUmLp8+Ww9OBlwDPTLK5/fcc4GzgxCTXASe245K3o5AWUZ9nK/0TkB3MPn4xa9F48HYU0uLp/YC0HphG0QXk7SikxWM4aCRG0QXkxWrS4vFJcBqJUXQBebGatHhsOWgk7AKSxpvhoJGwC0gab3YraSTsApLGmy0HSVKH4SBJ6jAcJEkdhsOE835FkoYxHCac9yuSNIzhMCZGtYfv/YokDWM4jIlR7eF7sZqkYQyHMTGqPXwvVpM0jBfBjYk1a5oWw8zMwu7he7GapGF22nJI8qok+y9GMdox9/AlLab5tBwOAr6Q5IvAO4GPV1WNtizN5h6+pMW005ZDVf0GcATN855/Ebguye8mefSIa5Mk9WReB6TblsJt7b8ZYH/ggiRvGVVhSU5Kcm2S65OcPqrtSJK65nPM4TVJLgfeAvwz8JNV9XLgp4H/NoqikuwG/BnwbGA18MIkq0exLUlS13yOOSwH/mtVfX1wYlXdm+Tk0ZTFUcD1VfVvAEnOA9YCV49oe5KkATsNh6r6zTnmXbOw5fzQIcBNA+NbgKeNaFuSpFmW6kVwGTLtx86QSnJakk1JNk1PTy9SWZI0GZZqOGwBDh0YXwXcMrhAVW2oqqmqmlqxYsWiFidJD3RLNRy+AByR5PAkDwFOAS7quSZJmhhL8vYZVTWT5FXAx4HdgHdW1VU9lyVJE2NJhgNAVX0U+GjfdUjSJFqq3UqSpB4ZDpKkDsNBktRhOEiSOgwHSVKH4SBJ6jAcJEkdhoMkqcNwkCR1GA6SpA7DQZLUYThIkjoMB0lSh+EgSeowHCRJHYaDJKnDcJAkdRgOkqQOw0GS1NFLOCT5gyRfSXJFkg8n2W9g3hlJrk9ybZJn9VGfJE26vloOFwNHVtWTgK8CZwAkWQ2cAjwROAn48yS79VSjJE2sXsKhqj5RVTPt6GeBVe3rtcB5VXV3Vd0AXA8c1UeNkjTJlsIxh5cBH2tfHwLcNDBvSzutI8lpSTYl2TQ9PT3iEiVpsiwb1YqTfBI4aMisM6vqwnaZM4EZ4NztbxuyfA1bf1VtADYATE1NDV1GknT/jCwcquqEueYnWQ+cDBxfVdu/3LcAhw4stgq4ZTQVSpJ2pK+zlU4C3gg8r6q+OzDrIuCUJLsnORw4Avh8HzVK0iQbWcthJ/4U2B24OAnAZ6vqV6rqqiTnA1fTdDe9sqru6alGSZpYvYRDVT1mjnlnAWctYjmSpFmWwtlKkqQlxnCQJHUYDpKkDsNBktRhOEiSOgwHSVKH4SBJ6jAcJEkdhoMkqcNwkCR1GA6SpA7DQZLUYThIkjoMB0lSh+EgSeowHCRJHYaDJKnDcJAkdRgOkqSOXsMhyeuTVJLlA9POSHJ9kmuTPKvP+iRpUi3ra8NJDgVOBG4cmLYaOAV4IvAI4JNJHltV9/RTpSRNpj5bDn8EvAGogWlrgfOq6u6qugG4Hjiqj+IkaZL1Eg5JngfcXFVfnjXrEOCmgfEt7bRh6zgtyaYkm6anp0dUqSRNppF1KyX5JHDQkFlnAr8O/JdhbxsyrYZMo6o2ABsApqamhi4jSbp/RhYOVXXCsOlJfhI4HPhyEoBVwBeTHEXTUjh0YPFVwC2jqlGSNNyidytV1ZVVdWBVHVZVh9EEwlOq6jbgIuCUJLsnORw4Avj8YtcoSZOut7OVhqmqq5KcD1wNzACv9EwlSVp8vYdD23oYHD8LOKufaiRJ4BXSkqQhDAdJUofhIEnqMBwW2tatcNxxsO++zXDr1r4rkqT7zHBYaOvWwWWXwbZtzXDdur4rkqT7zHBYaJs3w8xM83pmphmXpDFjOCy0NWtgWXuG8LJlzbgkjRnDYaFt3AjHHAP77NMMN27suyJJus96vwjuAWflSrj00r6rkKRdYstBktRhOEiSOgwHSVKH4SBJ6jAcJEkdhoMkqcNwkCR1GA6SpA7DQZLUYThIkjp6C4ckr05ybZKrkrxlYPoZSa5v5z2rr/okaZL1cm+lJD8LrAWeVFV3Jzmwnb4aOAV4IvAI4JNJHltV9/RRpyRNqr5aDi8Hzq6quwGq6vZ2+lrgvKq6u6puAK4HjuqpRkmaWH2Fw2OB/5Tkc0n+MclT2+mHADcNLLelndaR5LQkm5Jsmp6eHnG5kjRZRtatlOSTwEFDZp3Zbnd/4GjgqcD5SR4FZMjyNWz9VbUB2AAwNTU1dBlJ0v0zsnCoqhN2NC/Jy4EPVVUBn09yL7CcpqVw6MCiq4BbRlWjJGm4vrqVPgI8EyDJY4GHAN8ALgJOSbJ7ksOBI4DP91SjJE2svp4E907gnUn+Bfg+sL5tRVyV5HzgamAGeKVnKknS4uslHKrq+8CpO5h3FnDW4lYkSRrkFdKSpA7DQZLUMdnhsHUrHHcc7LtvM9y6te+KJGlJmOxwWLcOLrsMtm1rhuvW9V2RJC0Jkx0OmzfDzEzzemamGZckTXg4rFkDy9oTtpYta8YlSRMeDhs3wjHHwD77NMONG/uuSJKWhL4uglsaVq6ESy/tuwpJWnImu+UgSRrKcJAkdRgOkqQOw0GS1GE4SJI6DAdJUkeaxyiMtyTTwNd3YRXLaR42NA7GqVYYr3qtdXTGqd5xqhV2rd5HVtWKYTMeEOGwq5JsqqqpvuuYj3GqFcarXmsdnXGqd5xqhdHVa7eSJKnDcJAkdRgOjQ19F3AfjFOtMF71WuvojFO941QrjKhejzlIkjpsOUiSOgwHSVLHRIdDkpOSXJvk+iSn913PXJIcmuQfklyT5Kokr+27pp1JsluSLyX5275r2Zkk+yW5IMlX2p/xMX3XtCNJfq39HfiXJB9IskffNQ1K8s4ktyf5l4FpByS5OMl17XD/Pmvcbge1/kH7e3BFkg8n2a/HEn/MsHoH5r0+SSVZvhDbmthwSLIb8GfAs4HVwAuTrO63qjnNAK+rqicARwOvXOL1ArwWuKbvIubpbcDfV9XjgZ9iidad5BDgNcBUVR0J7Aac0m9VHe8GTpo17XTgkqo6ArikHV8K3k231ouBI6vqScBXgTMWu6g5vJtuvSQ5FDgRuHGhNjSx4QAcBVxfVf9WVd8HzgPW9lzTDlXVrVX1xfb1Npovr0P6rWrHkqwCngu8ve9adibJvsBxwDsAqur7VXVHr0XNbRnw0CTLgD2BW3qu58dU1aXAt2ZNXguc074+B3j+Yta0I8NqrapPVFX7cHk+C6xa9MJ2YAc/W4A/At4ALNgZRpMcDocANw2Mb2EJf9kOSnIY8GTgcz2XMpc/pvllvbfnOubjUcA08K62G+ztSfbqu6hhqupm4A9p9hBvBe6sqk/0W9W8rKyqW6HZ0QEO7Lme+XoZ8LG+i5hLkucBN1fVlxdyvZMcDhkybcmf15tkb+CDwK9W1V191zNMkpOB26vq8r5rmadlwFOAv6iqJwPfYel0e/yYtq9+LXA48AhgrySn9lvVA1OSM2m6c8/tu5YdSbIncCbwmwu97kkOhy3AoQPjq1hizfPZkjyYJhjOraoP9V3PHJ4OPC/J12i6656Z5H39ljSnLcCWqtreEruAJiyWohOAG6pquqp+AHwIOLbnmuZja5KDAdrh7T3XM6ck64GTgRfX0r4Y7NE0Owpfbv/eVgFfTHLQrq54ksPhC8ARSQ5P8hCag3oX9VzTDiUJTZ/4NVX11r7rmUtVnVFVq6rqMJqf66eqasnu3VbVbcBNSR7XTjoeuLrHkuZyI3B0kj3b34njWaIHz2e5CFjfvl4PXNhjLXNKchLwRuB5VfXdvuuZS1VdWVUHVtVh7d/bFuAp7e/0LpnYcGgPOL0K+DjNH9f5VXVVv1XN6enAS2j2wje3/57Td1EPIK8Gzk1yBbAG+N1+yxmubd1cAHwRuJLmb3hJ3e4hyQeAy4DHJdmS5JeAs4ETk1xHc1bN2X3WuN0Oav1TYB/g4vbv7C97LXLADuodzbaWdotJktSHiW05SJJ2zHCQJHUYDpKkDsNBktRhOEiSOgwHSVKH4SBJ6jAcpBFI8tT2eQB7JNmrff7CkX3XJc2XF8FJI5Lkd4A9gIfS3Lvp93ouSZo3w0EakfaeXV8A/gM4tqru6bkkad7sVpJG5wBgb5r79CypR3lKO2PLQRqRJBfR3LL8cODgqnpVzyVJ87as7wKkB6IkLwVmqur97fPKP5PkmVX1qb5rk+bDloMkqcNjDpKkDsNBktRhOEiSOgwHSVKH4SBJ6jAcJEkdhoMkqeP/A+JrWvX7x8vFAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "# 绘制观测数据的聚类情况\n",
    "for i in range(0, len(labels)):\n",
    "    if labels[i] == 0:\n",
    "        plt.scatter(i, data.take(i), s=15, c='red')\n",
    "    elif labels[i] == 1:\n",
    "        plt.scatter(i, data.take(i), s=15, c='blue')\n",
    "plt.title('Gaussian Mixture Model')\n",
    "plt.xlabel('x')\n",
    "plt.ylabel('y')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第4步：自编程实现高斯混合模型的EM算法**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import itertools\n",
    "\n",
    "\n",
    "class MyGMM:\n",
    "    def __init__(self, alphas_init, means_init, covariances_init, tol=1e-6, n_components=2, max_iter=50):\n",
    "        # (1)设置参数的初始值\n",
    "        # 分模型权重\n",
    "        self.alpha_ = np.array(\n",
    "            alphas_init, dtype=\"float16\").reshape(n_components, 1)\n",
    "        # 分模型均值\n",
    "        self.mean_ = np.array(\n",
    "            means_init, dtype=\"float16\").reshape(n_components, 1)\n",
    "        # 分模型标准差（方差的平方）\n",
    "        self.covariances_ = np.array(\n",
    "            covariances_init, dtype=\"float16\").reshape(n_components, 1)\n",
    "        # 迭代停止的阈值\n",
    "        self.tol = tol\n",
    "        # 高斯混合模型分量个数\n",
    "        self.K = n_components\n",
    "        # 最大迭代次数\n",
    "        self.max_iter = max_iter\n",
    "        # 观测数据\n",
    "        self._y = None\n",
    "        # 实际迭代次数\n",
    "        self.n_iter_ = 0\n",
    "\n",
    "    def gaussian(self, mean, convariances):\n",
    "        \"\"\"计算高斯分布概率密度\"\"\"\n",
    "        return 1 / np.sqrt(2 * np.pi * convariances) * np.exp(\n",
    "            -(self._y - mean) ** 2 / (2 * convariances))\n",
    "\n",
    "    def update_r(self, mean, convariances, alpha):\n",
    "        \"\"\"更新r_jk 分模型k对观测数据yi的响应度\"\"\"\n",
    "        r_jk = alpha * self.gaussian(mean, convariances)\n",
    "        return r_jk / r_jk.sum(axis=0)\n",
    "\n",
    "    def update_params(self, r):\n",
    "        \"\"\"更新u al si 每个分模型k的均值、权重、方差的平方\"\"\"\n",
    "        u = self.mean_[-1]\n",
    "        _mean = ((r * self._y).sum(axis=1) / r.sum(axis=1)).reshape(self.K, 1)\n",
    "        _covariances = ((r * (self._y - u) ** 2).sum(axis=1) /\n",
    "                        r.sum(axis=1)).reshape(self.K, 1)\n",
    "        _alpha = (r.sum(axis=1) / self._y.size).reshape(self.K, 1)\n",
    "        return _mean, _covariances, _alpha\n",
    "\n",
    "    def judge_stop(self, mean, covariances, alpha):\n",
    "        \"\"\"中止条件判断\"\"\"\n",
    "        a = np.linalg.norm(self.mean_ - mean)\n",
    "        b = np.linalg.norm(self.covariances_ - covariances)\n",
    "        c = np.linalg.norm(self.alpha_ - alpha)\n",
    "        return True if np.sqrt(a ** 2 + b ** 2 + c ** 2) < self.tol else False\n",
    "\n",
    "    def fit(self, y):\n",
    "        self._y = np.copy(np.array(y))\n",
    "        \"\"\"迭代训练获得预估参数\"\"\"\n",
    "        # (2)E步：计算分模型k对观测数据yi的响应度\n",
    "        r = self.update_r(self.mean_, self.covariances_, self.alpha_)\n",
    "        # 更新r_jk 分模型k对观测数据yi的响应度\n",
    "        _mean, _covariances, _alpha = self.update_params(r)\n",
    "        # 更新u al si 每个分模型k的均值、权重、方差的平方\n",
    "        for i in range(self.max_iter):\n",
    "            if not self.judge_stop(_mean, _covariances, _alpha):\n",
    "                # (4)未达到阈值条件，重复迭代\n",
    "                r = self.update_r(_mean, _covariances, _alpha)\n",
    "                # (3)M步：计算新一轮迭代的模型参数\n",
    "                _mean, _covariances, _alpha = self.update_params(r)\n",
    "            else:\n",
    "                # 达到阈值条件，停止迭代\n",
    "                self.n_iter_ = i\n",
    "                break\n",
    "\n",
    "            self.mean_ = _mean\n",
    "            self.covariances_ = _covariances\n",
    "            self.alpha_ = _alpha\n",
    "\n",
    "    def score(self):\n",
    "        \"\"\"计算该局部最优解的score，即似然函数值\"\"\"\n",
    "        return (self.alpha_ * self.gaussian(self.mean_, self.covariances_)).sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "alpha : [[0.56950675 0.43049325]]\n",
      "mean : [[27.41762854 12.35515017]]\n",
      "std : [[ 268.17311145 2772.33989897]]\n"
     ]
    }
   ],
   "source": [
    "# 观测数据\n",
    "y = np.array([-67, -48, 6, 8, 14, 16, 23, 24, 28,\n",
    "             29, 41, 49, 56, 60, 75]).reshape(1, 15)\n",
    "# 预估均值和方差，以其邻域划分寻优范围\n",
    "y_mean = y.mean() // 1\n",
    "y_std = (y.std() ** 2) // 1\n",
    "\n",
    "# 网格搜索，对不同的初值进行参数估计\n",
    "alpha = [[i, 1 - i] for i in np.linspace(0.1, 0.9, 9)]\n",
    "mean = [[y_mean + i, y_mean + j]\n",
    "        for i in range(-10, 10, 5) for j in range(-10, 10, 5)]\n",
    "covariances = [[y_std + i, y_std + j]\n",
    "               for i in range(-1000, 1000, 500) for j in range(-1000, 1000, 500)]\n",
    "results = []\n",
    "for i in itertools.product(alpha, mean, covariances):\n",
    "    init_alpha = i[0]\n",
    "    init_mean = i[1]\n",
    "    init_covariances = i[2]\n",
    "    clf = MyGMM(alphas_init=init_alpha, means_init=init_mean, covariances_init=init_covariances,\n",
    "                n_components=2, tol=1e-6)\n",
    "    clf.fit(y)\n",
    "    # 得到不同初值收敛的局部最优解\n",
    "    results.append([clf.alpha_, clf.mean_, clf.covariances_, clf.score()])\n",
    "# 根据score，从所有局部最优解找到相对最优解\n",
    "best_value = max(results, key=lambda x: x[3])\n",
    "\n",
    "print(\"alpha : {}\".format(best_value[0].T))\n",
    "print(\"mean : {}\".format(best_value[1].T))\n",
    "print(\"std : {}\".format(best_value[2].T))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 习题9.4\n",
    "&emsp;&emsp;EM算法可以用到朴素贝叶斯法的非监督学习，试写出其算法。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答：**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答思路：**  \n",
    "参考： http://www.cs.columbia.edu/~mcollins/em.pdf\n",
    "\n",
    "1. 列出EM算法；\n",
    "2. 列出朴素贝叶斯算法；\n",
    "3. 推导朴素贝叶斯的EM算法。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答步骤：**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第1步：EM算法**\n",
    "\n",
    "&emsp;&emsp;根据书中第178页EM算法：\n",
    "> 输入：观测变量数据$Y$，隐变量数据$Z$，联合分布$P(Y,Z|\\theta)$，条件分布$P(Z|Y,\\theta)$；  \n",
    "输出：模型参数$\\theta$。  \n",
    "（1）选择参数的初值$\\theta^{(0)}$，开始迭代；  \n",
    "（2）E步：记$\\theta^{(i)}$为第$i$次迭代参数$\\theta$的估计值，在第$i+1$次迭代的E步，计算\n",
    "> $$\n",
    "\\begin{aligned}\n",
    "Q(\\theta,\\theta^{(i)}) &= E_Z[\\log P(Y,Z | \\theta)| Y,\\theta^{(i)}] \\\\\n",
    "&= \\sum_z \\log P(Y,Z | \\theta) P(Z|Y,\\theta^{(i)})\n",
    "\\end{aligned}\n",
    "$$\n",
    "> 这里，$P(Z|Y, \\theta)$是在给定观测数据$Y$和当前的参数估计$\\theta^{(i)}$下隐变量数据$Z$的条件概率分布；  \n",
    "（3）M步：求使$Q(\\theta, \\theta^{(i)})$极大化的$\\theta$，确定第$i+1$次迭代的参数的估计值$\\theta^{(i+1)}$\n",
    "> $$\n",
    "\\theta^{(i+1)} = \\arg \\max \\limits_{\\theta} Q(\\theta, \\theta^{(i)})\n",
    "$$\n",
    ">（4）重复第（2）步和第（3）步，直至收敛。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第2步：朴素贝叶斯算法**\n",
    "\n",
    "&emsp;&emsp;根据书中第62页朴素贝叶斯算法：\n",
    "> 输入：训练数据$T={(x_1, y_1), (x_2, y_2), \\cdots, (x_N, y_N)}$，其中$x_i=(x_i^{(1)}, x_i^{(2)}, \\cdots, x_i^{(n)})^T$，$x_i^{(j)}$是第$i$个样本的第$j$个特征，$x_i^{(j)} \\in \\{a_{j1}, a_{j2},\\cdots, a_{j S_j}\\}$，$a_{jl}$是第$j$个特征可能取的第$l$个值，$j=1,2,\\cdots, n$，$l=1,2,\\cdots, S_j$，$y_i \\in \\{ c_1, c_2, \\cdots, c_K\\}$；实例$x$；  \n",
    "输出：实例$x$的分类。  \n",
    "（1）计算先验概率及条件概率\n",
    "> $$\n",
    "P(Y=c_k) = \\frac{\\displaystyle \\sum_{i=1}^N I(y_i=c_k)}{N}, \\quad k=1,2,\\cdots, K \\\\\n",
    "P(X^{(j)}=a_{jl}|Y=c_k)= \\frac{\\displaystyle \\sum_{i=1}^N I(x_i^{(j)} = a_{jl}, y_i=c_k) }{\\displaystyle \\sum_{i=1}^N I(y_i=c_k)} \\\\\n",
    "j=1,2,\\cdots,n; \\quad l=1,2,\\cdots, S_j; \\quad k=1,2,\\cdots, K\n",
    "$$  \n",
    ">（2）对于给定的实例$x=(x^{(1)}, x^{(2)}, \\cdots, x^{(n)})^T$，计算\n",
    "> $$\n",
    "P(Y=c_k) \\prod_{j=1}^n P(X^{(j)}=x^{(j)} | Y=c_k), \\quad k=1,2,\\cdots,K\n",
    "$$  \n",
    ">（3）确定实例$x$的类\n",
    "> $$\n",
    "y = \\arg \\max \\limits_{c_k} P(Y=c_k) \\prod_{j=1}^n P(X^{(j)}=x^{(j)} | Y=c_k)\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第3步：推导朴素贝叶斯的EM算法**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "推导思路：\n",
    "1. 假设隐变量数据是$y \\in \\mathcal{Y} = \\{c_1, c_2, \\cdots, c_K\\}$\n",
    "2. 设置初值，$P^{(0)}(Y=y) \\geqslant 0$和$P_j^{(0)}(X=x|Y=y) \\geqslant 0$，其中$j = 1,2,\\cdots, n$，满足\n",
    "$$\n",
    "\\sum_{y \\in \\mathcal{Y}} P^{(0)}(Y=y) = 1 \\\\\n",
    "\\sum_{x \\in \\{-1, +1\\}} P_j^{(0)}(X=x|Y=y)=1\n",
    "$$\n",
    "3. 根据概率公式，可知概率\n",
    "$$\n",
    "\\delta(y|i) = P(Y=y | X=x_i, \\theta^{(t)}) = \\frac\n",
    "{\\displaystyle P^{(t)}(Y=y) \\prod_{j=1}^n P_j^{(t)}(X=x_i^{(j)} | Y=y) }\n",
    "{\\displaystyle \\sum_{y \\in \\mathcal{Y}} P^{(t)}(Y=y) \\prod_{j=1}^n P_j^{(t)}(X=x_i^{(j)} | Y=y)}\n",
    "$$\n",
    "其中$\\theta$表示朴素贝叶斯模型中所有的参数向量\n",
    "4. 迭代更新参数\n",
    "$$\n",
    "P^{(t+1)}(Y=y) = \\frac{1}{N} \\sum_{i=1}^N \\delta(y | i) \\\\\n",
    "P_j^{(t+1)}(X=x_i^{(j)} | y) = \\frac\n",
    "{\\displaystyle \\sum_{i=1}^N P(X=x_i^{(j)})\\delta(y|i) }\n",
    "{\\displaystyle \\sum_{i=1}^N \\delta(y|i)}\n",
    "$$\n",
    "5. 计算似然函数，得到使得似然函数最大的$\\theta$，重复第3步和第4步，直至收敛\n",
    "$$\n",
    "\\begin{aligned}\n",
    "\\theta^* &= \\arg \\max \\limits_{\\theta \\in \\Omega} L(\\theta) \\\\\n",
    "&= \\arg \\max \\limits_{\\theta \\in \\Omega} \\sum_{i=1}^N \\sum_{y \\in \\mathcal{Y}} \\delta(y|i) \\log \\left(P(Y=y) \\prod_{j=1}^n P_j (X=x_i^{(j)} | Y=y)\\right)\n",
    "\\end{aligned}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "所以，朴素贝叶斯的EM算法如下：\n",
    "\n",
    "输入：隐变量数据是$y \\in \\mathcal{Y} = \\{c_1, c_2, \\cdots, c_K\\}$，$x \\in \\mathcal{X} = (x_1, x_2, \\cdots, x_N)$，输入空间$\\mathcal{X} \\subset R^n$为$n$维向量的集合，$x=(x^{(1)}, x^{(2)}, \\cdots, x^{(n)})^T$，$x^{(i)}$取值范围是$\\{-1, +1\\}$；  \n",
    "输出：参数$P^{(t+1)}(Y=y)$，$P_j^{(t+1)}(X=x_i^{(j)} | y)$；  \n",
    "（1）选择参数的初值$P^{(0)}(Y=y) \\geqslant 0$和$P_j^{(0)}(X=x|Y=y) \\geqslant 0$，开始迭代；  \n",
    "（2）E步：记$\\theta^{(t)}$为第$t$次迭代参数$\\theta$的估计值，在第$t+1$次迭代的E步，计算 $$\n",
    "\\delta(y|i) = P(Y=y | X=x_i, \\theta^{(t)}) = \\frac\n",
    "{\\displaystyle P^{(t)}(Y=y) \\prod_{j=1}^n P_j^{(t)}(X=x_i^{(j)} | Y=y) }\n",
    "{\\displaystyle \\sum_{y \\in \\mathcal{Y}} P^{(t)}(Y=y) \\prod_{j=1}^n P_j^{(t)}(X=x_i^{(j)} | Y=y)}\n",
    "$$ \n",
    "（3）M步：求使$Q(\\theta, \\theta^{(t)})$极大化的$\\theta$，确定第$t+1$次迭代的参数的估计值\n",
    "$$\n",
    "P^{(t+1)}(Y=y) = \\frac{1}{N} \\sum_{i=1}^N \\delta(y | i) \\\\\n",
    "P_j^{(t+1)}(X=x_i^{(j)} | y) = \\frac\n",
    "{\\displaystyle \\sum_{i=1}^N P(X=x_i^{(j)})\\delta(y|i) }\n",
    "{\\displaystyle \\sum_{i=1}^N \\delta(y|i)}\n",
    "$$\n",
    "（4）重复第（2）步和第（3）步，直至收敛。"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.11"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
